library(fpp2)
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──
## ✓ ggplot2 3.3.5 ✓ fma 2.4
## ✓ forecast 8.16 ✓ expsmooth 2.3
##
library(forecast)
library(urca)
autoplot(ma(usmelec,12,centre = TRUE))
autoplot(usmelec)
Generally, there has been an increase in the total use of electricity over the covered time frame. There is a fairly linear increase in the 12 month moving average of net generation of electricity between 1973 and 2007. From 2007 -2013, the data seems to deviate from this linear trend and plateaus.
boxcox_usmelec <- BoxCox(usmelec,BoxCox.lambda(usmelec))
autoplot(boxcox_usmelec)
autoplot(mstl(boxcox_usmelec))
From part (a), we can see that the variance of the autoplot is not constant and increases with the level, it could be indicative of a multiplicative model and therefore BoxCox transformation should be employed. We also opted to model Arima models directly onto boxcox transformed data instead of the seasonally adjusted component after mstl decomposition on the boxcox data as the seasonal component is almost negligible compared to the trend as seen in the decomposition plot.
boxcox_usmelec %>%
ur.kpss()%>%
summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 7.8337
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
nsdiffs(boxcox_usmelec) #1
## [1] 1
ndiffs(diff(boxcox_usmelec,12)) #1
## [1] 1
boxcox_usmelec%>%
diff(lag = 12) %>%
diff() %>%
ur.kpss()%>%
summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0167
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
acf(boxcox_usmelec)
acf(diff(diff(boxcox_usmelec,12)))
From the KPSS test, we can see that the data is not stationary with a test statistic of 7.8337, which is larger than the critical values in 1pct. Hence, we check if differencing is needed using nsdiffs and ndiffs. After differencing, the KPSS test statistic is lowered to 0.0167, which is lower than the critical value in 10pct.
ggtsdisplay(boxcox_usmelec)
ggtsdisplay(diff(diff(boxcox_usmelec,12)))
Based on the PACF plot of the transformed data, there may be presence of seasonal MA and non-seasonal MA due to the exponentially decaying PACF values at the seasonal lags and non-seasonal lags respectively. With reference to the ACF plot, seasonal MA could be 1 due to the spike at lag 12 and non-seasonal MA could be 3 due to the 3 significant spikes in ACF.
For AR, it is difficult to distinguish what order it has for both non-seasonal and seasonal as PACF plot could have been affected by the MA portion.From the ACF plot, we assume it is not a pure MA model as there are significant spikes outside of the relevant seasonal and non seasonal lags. From PACF, seasonal AR order could be 3 due to significant spikes at seasonal lags in PACF, and non-seasonal AR order could be 2 due to significant spikes in first 2 lags. It could also be 0 or 1 as PACF spikes might be a result of MA instead of AR.
As it is difficult to distinguish the order based on the ACF and PACF plots alone, we plan to test a few models by changing the orders around.
m1 <- auto.arima(usmelec, lambda = "auto", biasadj = T) #(1,1,3)(2,1,1)[12]
m2 <- Arima(usmelec, order = c(2,1,3), seasonal = c(3,1,1), lambda = "auto", biasadj = T)
m3 <- Arima(usmelec, order = c(0,1,3), seasonal = c(2,1,1), lambda = "auto", biasadj = T)
m4 <- Arima(usmelec, order = c(1,1,3), seasonal = c(2,1,2), lambda = "auto", biasadj = T)
m5 <- Arima(usmelec, order = c(1,1,2), seasonal = c(2,1,1), lambda = "auto", biasadj = T)
m6 <- Arima(usmelec, order = c(1,1,1), seasonal = c(2,1,1), lambda = "auto", biasadj = T)
m7 <- Arima(usmelec, order = c(1,1,1), seasonal = c(3,1,1), lambda = "auto", biasadj = T)
model_AIC = data.frame(model = c("m1", "m2", "m3", "m4", "m5", "m6", "m7"),
AIC = c(m1$aic, m2$aic, m3$aic, m4$aic, m5$aic, m6$aic, m7$aic))
model_AIC
## model AIC
## 1 m1 -5079.415
## 2 m2 -5077.322
## 3 m3 -5080.715
## 4 m4 -5079.223
## 5 m5 -5081.264
## 6 m6 -5082.634
## 7 m7 -5082.075
Best model based on AIC is m6 with lowest AIC value of -5082.634 compared to the other models.
checkresiduals(m1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,3)(2,1,1)[12]
## Q* = 25.995, df = 17, p-value = 0.07454
##
## Model df: 7. Total lags used: 24
checkresiduals(m3)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,3)(2,1,1)[12]
## Q* = 27.208, df = 18, p-value = 0.07518
##
## Model df: 6. Total lags used: 24
checkresiduals(m5)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)(2,1,1)[12]
## Q* = 26.61, df = 18, p-value = 0.08662
##
## Model df: 6. Total lags used: 24
checkresiduals(m6)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)(2,1,1)[12]
## Q* = 28.013, df = 19, p-value = 0.08318
##
## Model df: 5. Total lags used: 24
Since m3 and m5 has an AIC close to m6, we will consider it as well.
From the checkresiduals test, all models both pass the Ljung-box test with p-value of 0.07454, 0.07518, 0.08662, 0.08318 respectively. This means that we do not reject the null hypothesis that there is no time series information at the 5% level of significance. All models have residuals resembling resembling noise at 5% sig level.
latest_data <- read.csv("HW2 Q1 Cleaned Data.csv")
latest_data_ts <- ts(latest_data["Amount"], start = c(2013,7), deltat = 1/12)
accuracy(forecast(m1,h = 15*12), latest_data_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.6538782 7.25007 5.323888 -0.2513926 2.008707 0.5912316
## Test set -8.8457741 14.41367 11.697260 -2.6467902 3.474309 1.2990110
## ACF1 Theil's U
## Training set -0.03792581 NA
## Test set 0.53947979 0.467195
accuracy(forecast(m3,h = 15*12), latest_data_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.6299816 7.256241 5.322615 -0.244362 2.007970 0.5910902
## Test set -8.7515580 14.369285 11.645734 -2.618320 3.458043 1.2932889
## ACF1 Theil's U
## Training set -0.03382085 NA
## Test set 0.54060278 0.4656346
accuracy(forecast(m5,h = 15*12), latest_data_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.6496675 7.253729 5.322741 -0.2502208 2.008216 0.5911042
## Test set -8.8478413 14.417654 11.699077 -2.6473445 3.474800 1.2992128
## ACF1 Theil's U
## Training set -0.03789921 NA
## Test set 0.53959140 0.4673116
accuracy(forecast(m6,h = 15*12), latest_data_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.6677917 7.25979 5.326065 -0.2556993 2.010183 0.5914733
## Test set -8.9267115 14.45937 11.743866 -2.6712645 3.488996 1.3041867
## ACF1 Theil's U
## Training set -0.02746183 NA
## Test set 0.53910701 0.468791
autoplot(forecast(m3,h = 15*12), PI = F)
autoplot(forecast(m3,h = 15*12), PI = T)
From the accuracy tests, m3 attained the best RMSE score in the train and test set. Hence, m3 is the best model to use for prediction.
Q1 (g)
m3_forecast <-forecast(m3,h = 15*12)
interval_range_80 <- m3_forecast$upper[,1]-m3_forecast$lower[,1]
m3_forecast$mean/interval_range_80
## Jan Feb Mar Apr May Jun Jul
## 2013 11.105756
## 2014 8.839151 9.267178 8.921318 9.086906 8.461567 7.791046 6.972335
## 2015 6.495894 6.841631 6.641807 6.779138 6.318720 5.849071 5.377929
## 2016 5.261577 5.568315 5.424026 5.565206 5.209558 4.841808 4.446665
## 2017 4.404300 4.671157 4.556026 4.684201 4.392166 4.086615 3.762481
## 2018 3.759284 3.992980 3.899080 4.013378 3.766121 3.506485 3.237290
## 2019 3.260469 3.467633 3.389288 3.492303 3.279238 3.054966 2.825492
## 2020 2.864292 3.049765 2.983041 3.076593 2.890357 2.693823 2.494570
## 2021 2.542329 2.709736 2.652023 2.737434 2.572672 2.398399 2.223034
## 2022 2.275894 2.428066 2.377520 2.455893 2.308657 2.152609 1.996522
## 2023 2.052106 2.191299 2.146557 2.218822 2.086122 1.945222 1.804956
## 2024 1.861741 1.989781 1.949811 2.016744 1.896255 1.768102 1.641003
## 2025 1.698014 1.816398 1.780402 1.842656 1.732538 1.615219 1.499208
## 2026 1.555824 1.665795 1.633146 1.691275 1.590041 1.482010 1.375423
## 2027 1.431259 1.533862 1.504057 1.558535 1.464971 1.364957 1.266437
## 2028 1.321267 1.417391 1.390022 1.441261 1.354356 1.261299
## Aug Sep Oct Nov Dec
## 2013 9.659289 10.147321 10.259623 10.133421 9.231074
## 2014 6.758013 7.240567 7.389763 7.351133 6.732290
## 2015 5.298878 5.727332 5.884148 5.887053 5.427795
## 2016 4.388762 4.755730 4.897179 4.910321 4.535926
## 2017 3.719780 4.038444 4.165167 4.182383 3.867000
## 2018 3.206455 3.487272 3.601679 3.620935 3.350574
## 2019 2.802597 3.052687 3.156496 3.176537 2.941164
## 2020 2.477111 2.701793 2.796442 2.816545 2.608947
## 2021 2.209496 2.412913 2.499639 2.519415 2.334360
## 2022 1.985898 2.171320 2.251166 2.270407 2.103967
## 2023 1.796538 1.966584 2.040427 2.059026 1.908169
## 2024 1.634284 1.791089 1.859667 1.877582 1.739915
## 2025 1.493816 1.639143 1.703086 1.720307 1.593911
## 2026 1.371081 1.506406 1.566253 1.582795 1.466104
## 2027 1.262936 1.389510 1.445727 1.461616 1.353336
## 2028
4 years of forecasts would be sufficiently accurate to be usable. At 4 years, there is still high ratio of numerical forecast to the width of the 80% prediction interval at approximately 4. Furthermore, the trend and seasonality is also relatively stable for the usmelec data.
#Question 2
Total quarterly visitor nights (in millions) from 1998-2016 for twenty regions of Australia within six states. The states are: New South Wales, Queensland, South Australia, Victoria, Western Australia, and Other
autoplot(visnights)
There does not appear to be a long term directional trend. However, there does appear to be yearly seasonality in the data.
Granger causation means that a variable is affected by lagged values of another variable. It is likely that the total quarterly visitor nights in 2 adjacent regions within the same state would Granger cause each other. An individual deciding to stay in a particular state has the choice of whether to stay in one of those regions, and they are therefore substitute goods. Hence, we have chosen NSWNthIn and NSWSthIn.
index1 = which(colnames(visnights) == "NSWNthIn")
index2 = which(colnames(visnights) == "NSWSthIn")
autoplot(visnights[,index1])
autoplot(visnights[,index2])
# Check for whether seasonal differencing is required
nsdiffs(visnights[,index1])
## [1] 0
nsdiffs(visnights[,index2])
## [1] 0
ndiffs(visnights[,index1])
## [1] 0
ndiffs(visnights[,index2])
## [1] 0
From our analysis of nsdiffs and ndiffs, the variables are already in their stationary form, without having to do any differencing.
grangertest(visnights[,index1], visnights[,index2], order = 1)
## Granger causality test
##
## Model 1: visnights[, index2] ~ Lags(visnights[, index2], 1:1) + Lags(visnights[, index1], 1:1)
## Model 2: visnights[, index2] ~ Lags(visnights[, index2], 1:1)
## Res.Df Df F Pr(>F)
## 1 72
## 2 73 -1 17.098 9.483e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
grangertest(visnights[,index2], visnights[,index1], order = 1)
## Granger causality test
##
## Model 1: visnights[, index1] ~ Lags(visnights[, index1], 1:1) + Lags(visnights[, index2], 1:1)
## Model 2: visnights[, index1] ~ Lags(visnights[, index1], 1:1)
## Res.Df Df F Pr(>F)
## 1 72
## 2 73 -1 3.2452 0.07582 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p value from both granger causality tests are low (below 10%). The p value from the granger casuality test testing whether NSWNthIn granger causes NSWSthIn is 9.483e-05, and p value from the opposite granger test is 0.07582.
combined_df = cbind(visnights[,index1],visnights[,index2])
VARselect(combined_df, lag.max = 8)
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 4 1 1 4
##
## $criteria
## 1 2 3 4 5 6
## AIC(n) -4.3574722 -4.32594137 -4.33860191 -4.36626765 -4.29123573 -4.33197390
## HQ(n) -4.2798748 -4.19661241 -4.15754137 -4.13347552 -4.00671202 -3.99571861
## SC(n) -4.1616333 -3.99954318 -3.88164444 -3.77875090 -3.57315971 -3.48333861
## FPE(n) 0.0128122 0.01322813 0.01307389 0.01273828 0.01376672 0.01326737
## 7 8
## AIC(n) -4.27495886 -4.218295
## HQ(n) -3.88697198 -3.778576
## SC(n) -3.29576428 -3.108541
## FPE(n) 0.01411927 0.015046
var1 = VAR(combined_df, p =1, type = "const")
serial.test(var1, lags.pt = 10, type = "PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var1
## Chi-squared = 43.103, df = 36, p-value = 0.1935
We perform the Ljung Box test on the residuals. The probability of achieving a test statistic at least or more extreme than that which we achieve is 0.1935. Hence, we reject the null hypothesis - there is indeed no time series information in the residuals.